Introduction to the dataset¶
Insurance Dataset Analysis Report
Overview
This report presents an analysis of a synthetic insurance dataset using visualization and statistical tools to uncover patterns and relationships between key variables. The dataset is simulated based on real-world data to ensure privacy while maintaining realistic patterns for analysis.
Objective
The primary objectives of this analysis are:
Hypothesis 1: To assess whether the frequency of claims affects the premium amount.
To determine whether policyholder's credit score has a direct impact on the premium amount.
To provide insights that may enhance risk assessment and inform decision-making for insurance providers.
Dataset Summary
Size: 10,000 rows × 27 columns Missing Values: None Age Range: 18 to 90 years Average Age: Approximately 40 years
Initial Observations
The age distribution is right-skewed, not normal due to high volume of young policyholders. The most frequent age is 18, with 822 policyholders. The majority of policyholders have no claims, and very few have more than three claims.
Claim Severity Distribution
Low Impact Accidents: 70% Medium Impact Accidents: 20.4% High Severity Accidents: 9.6%
Premium Amount Analysis
The premium value distribution is approximately normal, centered between USD 2,100 and USD 2,400.
There is a clear pattern showing that premium values increase with claim frequency, which aligns with typical insurance pricing models.
Conclusion
The findings suggest that:
There is a direct relationship between claim frequency and premium amounts, and inverse relationship between credit score and premium amount, suggesting that the higher credit scores are associated with lower premium amount.
General overview of the dataset¶
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("C:/Users/aov_f/Downloads/Data Science/Diploma in Data Analytics - Fitzwilliam Institute/Final project/synthetic_insurance_data.csv")
print(df.shape)
df.describe()
(10000, 27)
| Age | Is_Senior | Married_Premium_Discount | Prior_Insurance_Premium_Adjustment | Claims_Frequency | Claims_Adjustment | Policy_Adjustment | Premium_Amount | Safe_Driver_Discount | Multi_Policy_Discount | ... | Total_Discounts | Time_Since_First_Contact | Conversion_Status | Website_Visits | Inquiries | Quotes_Requested | Time_to_Conversion | Credit_Score | Premium_Adjustment_Credit | Premium_Adjustment_Region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | ... | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 |
| mean | 39.991700 | 0.159300 | 42.131400 | 47.625000 | 0.497200 | 36.780000 | -79.860000 | 2219.571400 | 0.199900 | 0.305100 | ... | 30.110000 | 15.478000 | 0.576700 | 5.022900 | 1.996900 | 1.996900 | 46.07320 | 714.253400 | -11.320000 | 64.325000 |
| std | 14.050358 | 0.365974 | 42.993376 | 34.354438 | 0.716131 | 65.910288 | 97.955806 | 148.521132 | 0.399945 | 0.460473 | ... | 33.689782 | 8.677975 | 0.494107 | 2.238231 | 1.415588 | 0.817409 | 45.44845 | 49.749487 | 48.704156 | 39.232618 |
| min | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -200.000000 | 1800.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 | 530.000000 | -50.000000 | 0.000000 |
| 25% | 29.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -200.000000 | 2100.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 8.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 6.00000 | 681.000000 | -50.000000 | 50.000000 |
| 50% | 39.000000 | 0.000000 | 0.000000 | 50.000000 | 0.000000 | 0.000000 | 0.000000 | 2236.000000 | 0.000000 | 0.000000 | ... | 50.000000 | 16.000000 | 1.000000 | 5.000000 | 2.000000 | 2.000000 | 12.00000 | 715.000000 | -50.000000 | 50.000000 |
| 75% | 50.000000 | 0.000000 | 86.000000 | 50.000000 | 1.000000 | 50.000000 | 0.000000 | 2336.000000 | 0.000000 | 1.000000 | ... | 50.000000 | 23.000000 | 1.000000 | 6.000000 | 3.000000 | 3.000000 | 99.00000 | 748.000000 | 50.000000 | 100.000000 |
| max | 90.000000 | 1.000000 | 86.000000 | 100.000000 | 5.000000 | 800.000000 | 0.000000 | 2936.000000 | 1.000000 | 1.000000 | ... | 150.000000 | 30.000000 | 1.000000 | 16.000000 | 9.000000 | 3.000000 | 99.00000 | 850.000000 | 50.000000 | 100.000000 |
8 rows × 21 columns
First five rows¶
df.head(5)
| Age | Is_Senior | Marital_Status | Married_Premium_Discount | Prior_Insurance | Prior_Insurance_Premium_Adjustment | Claims_Frequency | Claims_Severity | Claims_Adjustment | Policy_Type | ... | Time_Since_First_Contact | Conversion_Status | Website_Visits | Inquiries | Quotes_Requested | Time_to_Conversion | Credit_Score | Premium_Adjustment_Credit | Region | Premium_Adjustment_Region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47 | 0 | Married | 86 | 1-5 years | 50 | 0 | Low | 0 | Full Coverage | ... | 10 | 0 | 5 | 1 | 2 | 99 | 704 | -50 | Suburban | 50 |
| 1 | 37 | 0 | Married | 86 | 1-5 years | 50 | 0 | Low | 0 | Full Coverage | ... | 22 | 0 | 5 | 1 | 2 | 99 | 726 | -50 | Urban | 100 |
| 2 | 49 | 0 | Married | 86 | 1-5 years | 50 | 1 | Low | 50 | Full Coverage | ... | 28 | 0 | 4 | 4 | 1 | 99 | 772 | -50 | Urban | 100 |
| 3 | 62 | 1 | Married | 86 | >5 years | 0 | 1 | Low | 50 | Full Coverage | ... | 4 | 1 | 6 | 2 | 2 | 2 | 809 | -50 | Urban | 100 |
| 4 | 36 | 0 | Single | 0 | >5 years | 0 | 2 | Low | 100 | Full Coverage | ... | 14 | 1 | 8 | 4 | 2 | 10 | 662 | 50 | Suburban | 50 |
5 rows × 27 columns
print(df.columns.tolist())
['Age', 'Is_Senior', 'Marital_Status', 'Married_Premium_Discount', 'Prior_Insurance', 'Prior_Insurance_Premium_Adjustment', 'Claims_Frequency', 'Claims_Severity', 'Claims_Adjustment', 'Policy_Type', 'Policy_Adjustment', 'Premium_Amount', 'Safe_Driver_Discount', 'Multi_Policy_Discount', 'Bundling_Discount', 'Total_Discounts', 'Source_of_Lead', 'Time_Since_First_Contact', 'Conversion_Status', 'Website_Visits', 'Inquiries', 'Quotes_Requested', 'Time_to_Conversion', 'Credit_Score', 'Premium_Adjustment_Credit', 'Region', 'Premium_Adjustment_Region']
Checking for missing values¶
df.isnull().sum()
Age 0 Is_Senior 0 Marital_Status 0 Married_Premium_Discount 0 Prior_Insurance 0 Prior_Insurance_Premium_Adjustment 0 Claims_Frequency 0 Claims_Severity 0 Claims_Adjustment 0 Policy_Type 0 Policy_Adjustment 0 Premium_Amount 0 Safe_Driver_Discount 0 Multi_Policy_Discount 0 Bundling_Discount 0 Total_Discounts 0 Source_of_Lead 0 Time_Since_First_Contact 0 Conversion_Status 0 Website_Visits 0 Inquiries 0 Quotes_Requested 0 Time_to_Conversion 0 Credit_Score 0 Premium_Adjustment_Credit 0 Region 0 Premium_Adjustment_Region 0 dtype: int64
df = df.dropna()
Checking data type¶
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 10000 non-null int64 1 Is_Senior 10000 non-null int64 2 Marital_Status 10000 non-null object 3 Married_Premium_Discount 10000 non-null int64 4 Prior_Insurance 10000 non-null object 5 Prior_Insurance_Premium_Adjustment 10000 non-null int64 6 Claims_Frequency 10000 non-null int64 7 Claims_Severity 10000 non-null object 8 Claims_Adjustment 10000 non-null int64 9 Policy_Type 10000 non-null object 10 Policy_Adjustment 10000 non-null int64 11 Premium_Amount 10000 non-null int64 12 Safe_Driver_Discount 10000 non-null int64 13 Multi_Policy_Discount 10000 non-null int64 14 Bundling_Discount 10000 non-null int64 15 Total_Discounts 10000 non-null int64 16 Source_of_Lead 10000 non-null object 17 Time_Since_First_Contact 10000 non-null int64 18 Conversion_Status 10000 non-null int64 19 Website_Visits 10000 non-null int64 20 Inquiries 10000 non-null int64 21 Quotes_Requested 10000 non-null int64 22 Time_to_Conversion 10000 non-null int64 23 Credit_Score 10000 non-null int64 24 Premium_Adjustment_Credit 10000 non-null int64 25 Region 10000 non-null object 26 Premium_Adjustment_Region 10000 non-null int64 dtypes: int64(21), object(6) memory usage: 2.1+ MB
Table showing Regional split¶
table = pd.DataFrame(df.Region.value_counts()).rename(columns = {'':'count'})
print(table)
count Region Urban 4921 Suburban 3023 Rural 2056
Regional distribution¶
df['Region'] = df['Region'].astype('category')
fig, ax = plt.subplots(figsize=(10, 6))
df['Region'].value_counts().plot(kind='bar', ax=ax, color='skyblue')
ax.set_xlabel('Region')
ax.set_ylabel('Count')
ax.set_title('Regional distribution')
plt.grid(True)
plt.show()
table = pd.DataFrame(df.Region.value_counts()).rename(columns = {'':'count'})
ax = table.plot.pie(
autopct='%1.1f%%',
figsize=(8, 8),
title='Regional distribution - Pie Chart',
subplots = True
)
plt.show()
Histogram of Age distribution¶
sns.histplot(df.Age, kde=True,
bins=int(180/5),
color='darkblue',
edgecolor='black',
linewidth=1)
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
mode = df['Age'].mode()
print(mode)
0 18 Name: Age, dtype: int64
Correlation of Variables - Pair Plots (EDA Analysis)¶
# Select only numeric columns
numeric_df = df.select_dtypes(include='number')
# Plot scatter matrix
scatter_matrix(numeric_df, figsize=(12, 12), diagonal='kde', alpha=0.5)
plt.suptitle("Pairwise Scatterplots (Pandas)", y=1.02)
plt.show()
numeric_df = df.select_dtypes(include='number')
sns.pairplot(numeric_df, diag_kind='kde', plot_kws={'alpha': 0.3, 's': 10})
plt.suptitle("Pairwise Scatterplots (Seaborn)", y=1.02)
plt.show()
correlation_matrix = df.corr(numeric_only=True)
# Display it
print(correlation_matrix)
Age Is_Senior \
Age 1.000000 0.694873
Is_Senior 0.694873 1.000000
Married_Premium_Discount -0.010954 0.003055
Prior_Insurance_Premium_Adjustment -0.112349 -0.050245
Claims_Frequency -0.005683 -0.010700
Claims_Adjustment -0.007991 -0.012606
Policy_Adjustment 0.006552 0.007301
Premium_Amount -0.029541 -0.016341
Safe_Driver_Discount -0.001894 0.001749
Multi_Policy_Discount 0.005307 0.003546
Bundling_Discount 0.019296 0.019519
Total_Discounts 0.010986 0.012044
Time_Since_First_Contact 0.012382 0.008079
Conversion_Status 0.010889 0.014002
Website_Visits -0.026851 -0.013611
Inquiries 0.004740 0.002691
Quotes_Requested 0.013878 0.010677
Time_to_Conversion -0.011568 -0.013851
Credit_Score 0.002005 -0.009078
Premium_Adjustment_Credit -0.000525 0.003270
Premium_Adjustment_Region 0.005704 -0.011979
Married_Premium_Discount \
Age -0.010954
Is_Senior 0.003055
Married_Premium_Discount 1.000000
Prior_Insurance_Premium_Adjustment 0.002825
Claims_Frequency 0.014028
Claims_Adjustment 0.015524
Policy_Adjustment -0.006465
Premium_Amount 0.291593
Safe_Driver_Discount 0.008348
Multi_Policy_Discount -0.011159
Bundling_Discount 0.002578
Total_Discounts -0.001537
Time_Since_First_Contact 0.010795
Conversion_Status -0.019536
Website_Visits 0.022060
Inquiries 0.005821
Quotes_Requested 0.005185
Time_to_Conversion 0.018586
Credit_Score -0.013903
Premium_Adjustment_Credit 0.007010
Premium_Adjustment_Region -0.014421
Prior_Insurance_Premium_Adjustment \
Age -0.112349
Is_Senior -0.050245
Married_Premium_Discount 0.002825
Prior_Insurance_Premium_Adjustment 1.000000
Claims_Frequency -0.009315
Claims_Adjustment -0.002384
Policy_Adjustment 0.008717
Premium_Amount 0.234541
Safe_Driver_Discount -0.009844
Multi_Policy_Discount 0.000924
Bundling_Discount -0.005321
Total_Discounts -0.007551
Time_Since_First_Contact 0.016891
Conversion_Status -0.005617
Website_Visits 0.004869
Inquiries 0.004887
Quotes_Requested -0.004358
Time_to_Conversion 0.003907
Credit_Score 0.005089
Premium_Adjustment_Credit -0.013828
Premium_Adjustment_Region 0.002057
Claims_Frequency Claims_Adjustment \
Age -0.005683 -0.007991
Is_Senior -0.010700 -0.012606
Married_Premium_Discount 0.014028 0.015524
Prior_Insurance_Premium_Adjustment -0.009315 -0.002384
Claims_Frequency 1.000000 0.803950
Claims_Adjustment 0.803950 1.000000
Policy_Adjustment 0.000091 -0.006312
Premium_Amount 0.355371 0.439130
Safe_Driver_Discount 0.012604 0.012242
Multi_Policy_Discount 0.002136 -0.003842
Bundling_Discount -0.013802 -0.018184
Total_Discounts 0.002873 -0.003354
Time_Since_First_Contact 0.001809 -0.007774
Conversion_Status -0.025537 -0.022603
Website_Visits 0.005437 0.005137
Inquiries 0.004283 0.005831
Quotes_Requested -0.010351 -0.009485
Time_to_Conversion 0.024032 0.021283
Credit_Score 0.002092 0.005915
Premium_Adjustment_Credit -0.008364 -0.014066
Premium_Adjustment_Region 0.000093 -0.002175
Policy_Adjustment Premium_Amount \
Age 0.006552 -0.029541
Is_Senior 0.007301 -0.016341
Married_Premium_Discount -0.006465 0.291593
Prior_Insurance_Premium_Adjustment 0.008717 0.234541
Claims_Frequency 0.000091 0.355371
Claims_Adjustment -0.006312 0.439130
Policy_Adjustment 1.000000 0.663374
Premium_Amount 0.663374 1.000000
Safe_Driver_Discount -0.007045 -0.131944
Multi_Policy_Discount 0.013864 -0.144375
Bundling_Discount -0.017149 -0.117614
Total_Discounts -0.002247 -0.228695
Time_Since_First_Contact 0.007639 -0.001183
Conversion_Status -0.065393 -0.078765
Website_Visits 0.016918 0.024758
Inquiries -0.008277 0.002993
Quotes_Requested 0.024137 0.007680
Time_to_Conversion 0.062277 0.074710
Credit_Score 0.001990 -0.251238
Premium_Adjustment_Credit 0.013203 0.325845
Premium_Adjustment_Region 0.006244 0.265795
Safe_Driver_Discount \
Age -0.001894
Is_Senior 0.001749
Married_Premium_Discount 0.008348
Prior_Insurance_Premium_Adjustment -0.009844
Claims_Frequency 0.012604
Claims_Adjustment 0.012242
Policy_Adjustment -0.007045
Premium_Amount -0.131944
Safe_Driver_Discount 1.000000
Multi_Policy_Discount -0.005373
Bundling_Discount -0.007008
Total_Discounts 0.586817
Time_Since_First_Contact 0.004768
Conversion_Status -0.001935
Website_Visits -0.010924
Inquiries -0.009681
Quotes_Requested 0.004649
Time_to_Conversion 0.002936
Credit_Score -0.017273
Premium_Adjustment_Credit 0.009132
Premium_Adjustment_Region -0.010425
Multi_Policy_Discount ... \
Age 0.005307 ...
Is_Senior 0.003546 ...
Married_Premium_Discount -0.011159 ...
Prior_Insurance_Premium_Adjustment 0.000924 ...
Claims_Frequency 0.002136 ...
Claims_Adjustment -0.003842 ...
Policy_Adjustment 0.013864 ...
Premium_Amount -0.144375 ...
Safe_Driver_Discount -0.005373 ...
Multi_Policy_Discount 1.000000 ...
Bundling_Discount -0.007740 ...
Total_Discounts 0.676809 ...
Time_Since_First_Contact 0.003444 ...
Conversion_Status 0.024390 ...
Website_Visits -0.017842 ...
Inquiries 0.000070 ...
Quotes_Requested 0.002779 ...
Time_to_Conversion -0.023948 ...
Credit_Score -0.016293 ...
Premium_Adjustment_Credit 0.014213 ...
Premium_Adjustment_Region 0.000246 ...
Total_Discounts Time_Since_First_Contact \
Age 0.010986 0.012382
Is_Senior 0.012044 0.008079
Married_Premium_Discount -0.001537 0.010795
Prior_Insurance_Premium_Adjustment -0.007551 0.016891
Claims_Frequency 0.002873 0.001809
Claims_Adjustment -0.003354 -0.007774
Policy_Adjustment -0.002247 0.007639
Premium_Amount -0.228695 -0.001183
Safe_Driver_Discount 0.586817 0.004768
Multi_Policy_Discount 0.676809 0.003444
Bundling_Discount 0.430216 -0.004614
Total_Discounts 1.000000 0.003155
Time_Since_First_Contact 0.003155 1.000000
Conversion_Status 0.020160 -0.010277
Website_Visits -0.018960 -0.002829
Inquiries -0.002342 0.004623
Quotes_Requested 0.002119 0.007865
Time_to_Conversion -0.018834 0.010437
Credit_Score -0.025476 0.013878
Premium_Adjustment_Credit 0.016362 -0.020797
Premium_Adjustment_Region -0.007813 -0.008584
Conversion_Status Website_Visits \
Age 0.010889 -0.026851
Is_Senior 0.014002 -0.013611
Married_Premium_Discount -0.019536 0.022060
Prior_Insurance_Premium_Adjustment -0.005617 0.004869
Claims_Frequency -0.025537 0.005437
Claims_Adjustment -0.022603 0.005137
Policy_Adjustment -0.065393 0.016918
Premium_Amount -0.078765 0.024758
Safe_Driver_Discount -0.001935 -0.010924
Multi_Policy_Discount 0.024390 -0.017842
Bundling_Discount 0.010554 -0.000642
Total_Discounts 0.020160 -0.018960
Time_Since_First_Contact -0.010277 -0.002829
Conversion_Status 1.000000 0.025315
Website_Visits 0.025315 1.000000
Inquiries -0.007024 -0.002313
Quotes_Requested -0.004983 0.001241
Time_to_Conversion -0.997763 -0.024612
Credit_Score 0.011398 -0.006573
Premium_Adjustment_Credit -0.023969 0.001874
Premium_Adjustment_Region -0.023537 -0.004192
Inquiries Quotes_Requested \
Age 0.004740 0.013878
Is_Senior 0.002691 0.010677
Married_Premium_Discount 0.005821 0.005185
Prior_Insurance_Premium_Adjustment 0.004887 -0.004358
Claims_Frequency 0.004283 -0.010351
Claims_Adjustment 0.005831 -0.009485
Policy_Adjustment -0.008277 0.024137
Premium_Amount 0.002993 0.007680
Safe_Driver_Discount -0.009681 0.004649
Multi_Policy_Discount 0.000070 0.002779
Bundling_Discount 0.007635 -0.005777
Total_Discounts -0.002342 0.002119
Time_Since_First_Contact 0.004623 0.007865
Conversion_Status -0.007024 -0.004983
Website_Visits -0.002313 0.001241
Inquiries 1.000000 0.003449
Quotes_Requested 0.003449 1.000000
Time_to_Conversion 0.007048 0.003608
Credit_Score -0.025562 0.014190
Premium_Adjustment_Credit 0.006526 -0.018341
Premium_Adjustment_Region 0.001430 0.007466
Time_to_Conversion Credit_Score \
Age -0.011568 0.002005
Is_Senior -0.013851 -0.009078
Married_Premium_Discount 0.018586 -0.013903
Prior_Insurance_Premium_Adjustment 0.003907 0.005089
Claims_Frequency 0.024032 0.002092
Claims_Adjustment 0.021283 0.005915
Policy_Adjustment 0.062277 0.001990
Premium_Amount 0.074710 -0.251238
Safe_Driver_Discount 0.002936 -0.017273
Multi_Policy_Discount -0.023948 -0.016293
Bundling_Discount -0.009576 -0.009299
Total_Discounts -0.018834 -0.025476
Time_Since_First_Contact 0.010437 0.013878
Conversion_Status -0.997763 0.011398
Website_Visits -0.024612 -0.006573
Inquiries 0.007048 -0.025562
Quotes_Requested 0.003608 0.014190
Time_to_Conversion 1.000000 -0.010306
Credit_Score -0.010306 1.000000
Premium_Adjustment_Credit 0.022576 -0.787820
Premium_Adjustment_Region 0.023592 0.000910
Premium_Adjustment_Credit \
Age -0.000525
Is_Senior 0.003270
Married_Premium_Discount 0.007010
Prior_Insurance_Premium_Adjustment -0.013828
Claims_Frequency -0.008364
Claims_Adjustment -0.014066
Policy_Adjustment 0.013203
Premium_Amount 0.325845
Safe_Driver_Discount 0.009132
Multi_Policy_Discount 0.014213
Bundling_Discount 0.002794
Total_Discounts 0.016362
Time_Since_First_Contact -0.020797
Conversion_Status -0.023969
Website_Visits 0.001874
Inquiries 0.006526
Quotes_Requested -0.018341
Time_to_Conversion 0.022576
Credit_Score -0.787820
Premium_Adjustment_Credit 1.000000
Premium_Adjustment_Region 0.001261
Premium_Adjustment_Region
Age 0.005704
Is_Senior -0.011979
Married_Premium_Discount -0.014421
Prior_Insurance_Premium_Adjustment 0.002057
Claims_Frequency 0.000093
Claims_Adjustment -0.002175
Policy_Adjustment 0.006244
Premium_Amount 0.265795
Safe_Driver_Discount -0.010425
Multi_Policy_Discount 0.000246
Bundling_Discount -0.004078
Total_Discounts -0.007813
Time_Since_First_Contact -0.008584
Conversion_Status -0.023537
Website_Visits -0.004192
Inquiries 0.001430
Quotes_Requested 0.007466
Time_to_Conversion 0.023592
Credit_Score 0.000910
Premium_Adjustment_Credit 0.001261
Premium_Adjustment_Region 1.000000
[21 rows x 21 columns]
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', center=0)
plt.title("Correlation Matrix")
plt.show()
Histogram of Claims Frequency Distribution¶
sns.histplot(df.Claims_Frequency, kde=True,
bins=int(180/5),
color='darkblue',
edgecolor='black',
linewidth=1)
plt.title('Claims distribution')
plt.xlabel('Claims Frequency')
plt.ylabel('Count')
plt.grid(True)
plt.show()
df['Claims_Frequency'] = df['Claims_Frequency'].astype('category')
fig, ax = plt.subplots(figsize=(10, 6))
df['Claims_Frequency'].value_counts().plot(kind='bar', ax=ax, color='skyblue')
ax.set_xlabel('Claims Frequency')
ax.set_ylabel('Count')
ax.set_title('Claims distribution')
plt.grid(True)
plt.show()
Table showing the count of accidents categorized by severity¶
table = pd.DataFrame(df.Claims_Severity.value_counts()).rename(columns = {'':'count'})
print(table)
count Claims_Severity Low 7003 Medium 2038 High 959
Claims categorized by severity - Pie Chart¶
ax = table.plot.pie(
autopct='%1.1f%%',
figsize=(8, 8),
title='Claims categorized by severity - Pie Chart',
subplots = True
)
plt.show()
Histogram showing Premium Amount distribution¶
plt.figure(figsize=(10, 6))
plt.hist(df['Premium_Amount'], bins=30, alpha=0.7)
plt.xlabel('Premium Amount')
plt.ylabel('Count')
plt.title('Histogram - Premium Amount')
plt.grid(True)
plt.show()
sns.histplot(df.Premium_Amount, kde=True,
bins=int(180/5),
color='darkblue',
edgecolor='black',
linewidth=1)
plt.title('Premium Amount distribution')
plt.xlabel('Premium Amount')
plt.ylabel('Count')
plt.grid(True)
plt.show()
Box plot of Premium Amount Distribution - Checking for outliers¶
plt.boxplot(df.Premium_Amount)
plt.title("Boxplot of Premium Amount distribution")
plt.show()
Hypothesis 1: Ho (null hypothesis): Claims Frequency affects the Premium Amount¶
Split dataset into train (75%) and test (25%)¶
x = df[['Claims_Frequency']]
y = df['Premium_Amount']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
print(f"Training data size: {x_train.shape[0]}")
print(f"Testing data size: {x_test.shape[0]}")
Training data size: 7500 Testing data size: 2500
Training the Regression model¶
model = LinearRegression()
# Train the model using the training set
model.fit(x_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
# Make predictions on the test set
y_pred = model.predict(x_test)
Checking correlation between Claims Frequency and Premium Amount¶
x = df['Claims_Frequency']
y = df['Premium_Amount']
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Premium_Amount R-squared: 0.126
Model: OLS Adj. R-squared: 0.126
Method: Least Squares F-statistic: 1445.
Date: Wed, 07 May 2025 Prob (F-statistic): 1.78e-295
Time: 16:12:38 Log-Likelihood: -63521.
No. Observations: 10000 AIC: 1.270e+05
Df Residuals: 9998 BIC: 1.271e+05
Df Model: 1
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const 2182.9269 1.690 1291.544 0.000 2179.614 2186.240
Claims_Frequency 73.7018 1.939 38.015 0.000 69.901 77.502
==============================================================================
Omnibus: 123.162 Durbin-Watson: 1.974
Prob(Omnibus): 0.000 Jarque-Bera (JB): 81.995
Skew: -0.089 Prob(JB): 1.57e-18
Kurtosis: 2.593 Cond. No. 1.94
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Testing model accuracy¶
# Prepare data
x = df[['Claims_Frequency']].values
y = df['Premium_Amount'].values
# Fit model
reg = LinearRegression()
reg.fit(x, y)
# Predictions
y_pred = reg.predict(x)
# Metrics
print(f"R² score: {reg.score(x, y)}")
print(f"Slope: {reg.coef_[0]}")
print(f"Intercept: {reg.intercept_}")
print(f"RMSE: {np.sqrt(mean_squared_error(y, y_pred))}")
R² score: 0.12628866618004875 Slope: 73.70178966854829 Intercept: 2182.926870176798 RMSE: 138.81951368500023
# Predictions
y_pred = reg.predict(x)
# Plot
plt.figure(figsize=(8, 6))
plt.scatter(x, y, alpha=0.5, label='Actual Data')
plt.plot(x, y_pred, color='red', label='Regression Line')
plt.xlabel('Claims Frequency')
plt.ylabel('Premium Amount')
plt.title('Linear Regression: Claims Frequency vs Premium Amount')
plt.legend()
plt.grid(True)
plt.show()
Hypothesis 2: Ho(null hypothesis): Credit Score affects the Premium Amount¶
Credit Score distribution¶
sns.histplot(df.Credit_Score, kde=True,
bins=int(180/5),
color='darkblue',
edgecolor='black',
linewidth=1)
plt.title('Credit Score distribution')
plt.xlabel('Credit Score')
plt.ylabel('Count')
plt.grid(True)
plt.show()
Split dataset into train(75%) and test(25%)¶
x = df[['Credit_Score']]
y = df['Premium_Amount']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
print(f"Training data size: {x_train.shape[0]}")
print(f"Testing data size: {x_test.shape[0]}")
Training data size: 7500 Testing data size: 2500
Training the regression model¶
model = sm.OLS(y_train, sm.add_constant(x_train)) # model creation
results = model.fit() # fitting the model
# Predict:
predictions = results.predict(sm.add_constant(x_test))
plt.scatter(df['Credit_Score'], df['Premium_Amount'], alpha=0.5, label='Data')
plt.plot(df['Credit_Score'], y_pred, color='red', label='Regression Line')
plt.xlabel('Credit_Score')
plt.ylabel('Premium Amount')
plt.title('Credit Score vs Premium Amount')
plt.legend()
plt.show()
x = df['Credit_Score']
y = df['Premium_Amount']
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Premium_Amount R-squared: 0.063
Model: OLS Adj. R-squared: 0.063
Method: Least Squares F-statistic: 673.6
Date: Fri, 09 May 2025 Prob (F-statistic): 8.87e-144
Time: 13:38:50 Log-Likelihood: -63870.
No. Observations: 10000 AIC: 1.277e+05
Df Residuals: 9998 BIC: 1.278e+05
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 2755.2915 20.691 133.162 0.000 2714.732 2795.851
Credit_Score -0.7500 0.029 -25.954 0.000 -0.807 -0.693
==============================================================================
Omnibus: 40.404 Durbin-Watson: 1.982
Prob(Omnibus): 0.000 Jarque-Bera (JB): 43.592
Skew: 0.122 Prob(JB): 3.42e-10
Kurtosis: 3.212 Cond. No. 1.03e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.03e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
# Prepare data
x = df[['Credit_Score']].values
y = df['Premium_Amount'].values
# Fit model
reg = LinearRegression()
reg.fit(x, y)
# Predictions
y_pred = reg.predict(x)
# Metrics
print(f"R² score: {reg.score(x, y)}")
print(f"Slope: {reg.coef_[0]}")
print(f"Intercept: {reg.intercept_}")
print(f"RMSE: {np.sqrt(mean_squared_error(y, y_pred))}")
R² score: 0.06312071577216283 Slope: -0.7500420247872055 Intercept: 2755.2914663471456 RMSE: 143.75016505043345
# Predictions
y_pred = reg.predict(x)
# Plot
plt.figure(figsize=(8, 6))
plt.scatter(x, y, alpha=0.5, label='Actual Data')
plt.plot(x, y_pred, color='red', label='Regression Line')
plt.xlabel('Credit Score')
plt.ylabel('Premium Amount')
plt.title('Linear Regression: Credit Score vs Premium Amount')
plt.legend()
plt.grid(True)
plt.show()